go to index

Basic Data Exploration

read time 7 min read
Kaggle Intro Machine Learning

Basic Data Exploration

Using Pandas to Get Familiar With Your Data

在任何机器学习项目中,第一步是熟悉数据。您将使用Pandas库来完成这项工作。Pandas是数据科学家用来探索和操作数据的主要工具。大多数人在他们的代码中将pandas缩写为pd。我们使用以下命令来实现这一点:

python
import pandas as pd	

Pandas库中最重要的部分是DataFrame。DataFrame 存储了你可能认为是表格形式的数据。这类似于Excel中的工作表,或者SQL数据库中的表。

Pandas拥有强大的方法来处理这类数据的大部分需求。

作为示例,我们将查看关于澳大利亚墨尔本房价的数据。在实践练习中,你将应用相同的过程到一个新的数据集,该数据集包含了爱荷华州的房价。

墨尔本的示例数据位于文件路径../input/melbourne-housing-snapshot/melb_data.csv

我们使用以下命令来加载和探索数据:

python
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()	
RoomsPriceDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltLattitudeLongtitudePropertycount
count13580.0000001.358000e+0413580.00000013580.00000013580.00000013580.00000013518.00000013580.0000007130.0000008205.00000013580.00000013580.00000013580.000000
mean2.9379971.075684e+0610.1377763105.3019152.9147281.5342421.610075558.416127151.9676501964.684217-37.809203144.9952167454.417378
std0.9557486.393107e+055.86872590.6769640.9659210.6917120.9626343990.669241541.01453837.2737620.0792600.1039164378.581772
min1.0000008.500000e+040.0000003000.0000000.0000000.0000000.0000000.0000000.0000001196.000000-38.182550144.431810249.000000
25%2.0000006.500000e+056.1000003044.0000002.0000001.0000001.000000177.00000093.0000001940.000000-37.856822144.9296004380.000000
50%3.0000009.030000e+059.2000003084.0000003.0000001.0000002.000000440.000000126.0000001970.000000-37.802355145.0001006555.000000
75%3.0000001.330000e+0613.0000003148.0000003.0000002.0000002.000000651.000000174.0000001999.000000-37.756400145.05830510331.000000
max10.0000009.000000e+0648.1000003977.00000020.0000008.00000010.000000433014.00000044515.0000002018.000000-37.408530145.52635021650.000000

Interpreting Data Description

结果显示了原始数据集中每个列的8个数字。第一个数字,计数,显示有多少行具有非缺失值。

缺失值可能由许多原因产生。例如,在调查一个只有一间卧室的房子时,不会收集第二间卧室的大小。我们将在稍后讨论缺失数据的主题。

第二个值是均值,也就是平均值。下面,std是标准差,它衡量数值的分散程度。

要解释最小值、25%、50%、75%和最大值,可以想象将每个列按从低到高的顺序排序。第一个(最小的)值是最小值。如果你走到列表的四分之一处,你会发现一个数字,它比25%的值大,比75%的值小。那就是25%的值(发音为”第二十五百分位”)。50%和75%的百分位定义类似,最大值是最大的数字。

Exercise:Explore Your Data

Step 1: Loading Data

Read the Iowa data file into a Pandas DataFrame called home_data.

python
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)

# Call line below with no argument to check that you've loaded the data correctly
step_1.check()
Step 2: Review The Data

Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions

python
# Print summary statistics in next line
home_data.describe()
IdMSSubClassLotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1WoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaMiscValMoSoldYrSoldSalePrice
count1460.0000001460.0000001201.0000001460.0000001460.0000001460.0000001460.0000001460.0000001452.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.000000
mean730.50000056.89726070.04995810516.8280826.0993155.5753421971.2678081984.865753103.685262443.63972694.24452146.66027421.9541103.40958915.0609592.75890443.4890416.3219182007.815753180921.195890
std421.61000942.30057124.2847529981.2649321.3829971.11279930.20290420.645407181.066207456.098091125.33879466.25602861.11914929.31733155.75741540.177307496.1230242.7036261.32809579442.502883
min1.00000020.00000021.0000001300.0000001.0000001.0000001872.0000001950.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000002006.00000034900.000000
25%365.75000020.00000059.0000007553.5000005.0000005.0000001954.0000001967.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000005.0000002007.000000129975.000000
50%730.50000050.00000069.0000009478.5000006.0000005.0000001973.0000001994.0000000.000000383.5000000.00000025.0000000.0000000.0000000.0000000.0000000.0000006.0000002008.000000163000.000000
75%1095.25000070.00000080.00000011601.5000007.0000006.0000002000.0000002004.000000166.000000712.250000168.00000068.0000000.0000000.0000000.0000000.0000000.0000008.0000002009.000000214000.000000
max1460.000000190.000000313.000000215245.00000010.0000009.0000002010.0000002010.0000001600.0000005644.000000857.000000547.000000552.000000508.000000480.000000738.00000015500.00000012.0000002010.000000755000.000000

结果显示了原始数据集中每个列的8个数字。第一个数字,计数,显示有多少行具有非缺失值

缺失值可能由许多原因产生。例如,在调查一个只有一间卧室的房子时,不会收集第二间卧室的大小。我们将在稍后讨论缺失数据的主题。

第二个值是均值,也就是平均值。下面,std是标准差,它衡量数值的分散程度。

要解释最小值、25%、50%、75%和最大值,可以想象将每个列按从低到高的顺序排序。第一个(最小的)值是最小值。如果你走到列表的四分之一处,你会发现一个数字,它比25%的值大,比75%的值小。那就是25%的值(发音为”第二十五百分位”)。50%和75%的百分位定义类似,最大值是最大的数字。

python
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data['LotArea'].mean())

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2024 - round(home_data['YearBuilt'].max())

# Checks your answers
step_2.check()

Analyze:

Question:What is the average lot size (rounded to nearest integer)?

这个问题在翻译时,可能会错误的翻译为**“平均地块大小(四舍五入到最近的整数)是多少?”**从而导致无从下手,实际上这个问题的翻译应该是 lot地区 的平均面积是多少(四舍五入到最近的整数)

在前文中我们使用pandas导入了训练数据集home-data-for-ml-course/train.csv,并在下一行打印了统计信息。

python
import pandas as pd

# 数据集的读取路径
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# 将文件读入一个名为 home_data 的变量中
home_data = pd.read_csv(iowa_file_path)

# Print summary statistics in next line
home_data.describe()

显然,我们需要读取出特定行列的值,如何实现呢?

坦白的来说,并不存在特定的行, 因为我们所看到的行名,比如说count,mean,std……实际上是被计算完成后的统计量,在使用时,我们实际上只能指定列名,但幸运的是,我们也不需要找到特定的一行,而是得到一些统计上的特征量

python
# home_data 是Pandas中的DataFrame,home_data['LotArea']则指定了该DataFrame中名为LotArea的列,而 mean()方法则用于计算均值,同样的 max()方法用于计算最大值
# 所以 What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data['LotArea'].mean())
# As of today, how old is the newest home (current year - the date in which it was built)
# 距今(2024),YearBuilt列中最大的,就是最晚建造的,也就是最新建造的
newest_home_age = 2024 - round(home_data['YearBuilt'].max())